Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
The course seems to be set up nicely. At least much better than some online courses about statistics/programming. I hope I’ll relearn to use R (briefly tried it about twelve years ago). RStudio seems convenient and I like the GitHub integration. Additionally, I hope to deepen my understanding of statistical methods.
My GitHub repository: https://github.com/MTurkkila/IODS-projectlink
I started with the Datacamp and afterwards th data wrangling was straightforward and easy. Thus I used the data file from my local folder in the analysis. Below are the important parts of the script. Before any analysis, ggplot2 and GGally libraries are accessed first as is the common practice.
library(ggplot2)
library(GGally)
Read the data from a local folder and check for structure and dimensions
students2014 <- read.table("data/students2014.txt", sep="\t")
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
The Data has 7 variables and 166 observations from a questionnaire. Variables deep, stra and surf are sum variables scaled to the original scale. The data only includes observations with points over 0.
Next I use the ggpair-function to show graphical overview of the data. In the actual script, I also save the plots in a local folder plots with dev.copy()-function.
p <- ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
In addition to plotting the dataframe, I checked the summaries of the data.
summary(students2014)
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
As attitude, stra and surf have the highest absolute correlation with points I chose those for the initial linear model and created a regression model with multiple explanatory variables and print the summary of my model.
my_model <- lm(Points ~ Attitude + stra + surf, data = students2014)
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The summary shows estimations for the parameters of the linear model and the statistical significance for those estimates. Only attitude is statistically significant (shown with the stars) and therefore I only include it to the second model
my_model2 <- lm(Points ~ Attitude, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The summaries show that now both \(\alpha\) and \(\beta\) variables are highly statistically significant. The estimates show that attitude has positive correlation with the attitude i.e. more positive attitude yields more points.
The multiple R squared shows how much of the variation in the points is explains by the attitude. In this model, it is 19%
Check for validity with diagnostic plots
par(mfrow = c(2,2))
plot(my_model2, which = c(1, 2, 5))
The model assumes that errors are normally distributed, are not correlated and have constant variance.
Overall, I’d say the model is quite reasonable.
The data includes background information and alcohol consumption of Portuguese students. The data is joined from two original data sets of student performance in math and in Portuguese language. The joined data includes two new variables: alc_use and high_use. alc_use is mean of weekend alcohol use (Walc) and workday alcohol use (Dalc). Variable high_use is boolean value with condition: high_use > 2.
Below is list of the variables in the data. The complete description of original data can be found at UCI Machine Learning repository
alc <- read.csv("data/alc.csv")
variable.names(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The purpose of this analysis is to examine relationship between alcohol consumption and free time. Four variables related to the free time and thus chosen for the analysis are:
The hypothesis is that more free time and going out a student has greater the chance for high alcohol consumption.
Firstly, to study distributions of the chosen variables, let’s draw bar plots:
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
alc4 <- c("freetime", "traveltime", "goout","activities")
gather(alc[alc4]) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Free time and going out seems to be quite normally distributed. Home to school travel time is heavily skewed to short, under 15 minutes travel times. Extra-curricular activities is split quite evenly. Thus, it can be assumed that variables free time and going out has more significance in alcohol consumption.
Let’s also do simple cross-tabulation of high alcohol consumption with median of free time and going out.
alc %>% group_by(high_use) %>% summarise(count = n(), going_out = median(goout), freetime = median(freetime))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 4
## high_use count going_out freetime
## <lgl> <int> <dbl> <dbl>
## 1 FALSE 268 3 3
## 2 TRUE 114 4 3
Median for going out is higher for the group that goes out more. It might be that free time in itself does not affect alcohol consumption as much. However let’s start the logistic model with all four variables as is instructed.
To build the logistic model I use glm-function with all four variables and print the summary of the model
m <- glm(high_use ~ freetime + traveltime + goout + activities, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ freetime + traveltime + goout + activities,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5791 -0.7737 -0.5840 0.9795 2.3555
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1598 0.6076 -6.847 7.56e-12 ***
## freetime 0.1524 0.1340 1.137 0.2553
## traveltime 0.4226 0.1723 2.452 0.0142 *
## goout 0.7225 0.1213 5.959 2.54e-09 ***
## activitiesyes -0.3633 0.2446 -1.485 0.1376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 406.48 on 377 degrees of freedom
## AIC: 416.48
##
## Number of Fisher Scoring iterations: 4
As expected, going out is significant predictor for high alcohol consumption. Surprisingly, also travel time has statistical signifigance. Let’s build new model with these two variables.
m <- glm(high_use ~ traveltime + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ traveltime + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4676 -0.8436 -0.6050 1.0674 2.3813
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.9443 0.4973 -7.932 2.16e-15 ***
## traveltime 0.4142 0.1712 2.419 0.0156 *
## goout 0.7553 0.1165 6.485 8.87e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 409.76 on 379 degrees of freedom
## AIC: 415.76
##
## Number of Fisher Scoring iterations: 4
Compute odds ratios OR and confidence intervals CI and print them.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01936554 0.007015459 0.04951509
## traveltime 1.51309541 1.083767881 2.12550643
## goout 2.12821744 1.703804423 2.69227727
As the odds ratio is higher that 1 for both variables, it means that both variables are positively associated with the high alcohol consumption.
To explore the predictive power of the logistic model m, I use predict() function to make predictions for high alcohol consumption. The probabilities from the predictions are added to the alc dataframe as well as boolean value with the same condition, >0.5 as the observational data. Finally we print cross tabulation of predictions and actual values.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 247 21
## TRUE 78 36
Let’s also plot the actual values and the predictions
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
Looks like there are quite many false predictions so let’s check the training error by first defining loss function
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
Next we call that loss function to compute percentage of wrong predictions
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2591623
The result shows that around 26% of the predictions are wrong. The model is not very good at predictions, but it still is somewhat better than simply fifty-fifty guessing.
Let’s use cv.glm() function form ‘boot’ library for K-fold cross-validation and start with 10-fold cross-validation. The function returns vector ‘delta’ which first component is the estimate for prediction error. The cv.glm() function uses the previously defined loss functions as the cost function.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2617801
The predictor error is slightly higher than the training error and it is also the close to the error in the datacamp exercise. Let’s check if it is possible to find logistic model with smaller error.
The next bit of code uses ten different logistic regression models with different number of predictor variables. In a for loop prediction probabilities are computed for each model. Additionally, 10-fold cross-validation is performed. Finally, training and prediction errors are plotted against number of predictors
# Dataframe for training and prediction errors
errors <- data.frame(matrix(ncol=3,nrow=0, dimnames=list(NULL, c("n_pred", "training", "prediction"))))
# Models with different number of predictors
m1 <- high_use ~ goout + traveltime + freetime + activities + studytime + paid + romantic+G1+G2+G3
m2 <- high_use ~ goout + traveltime + freetime + activities + studytime + paid + romantic+G1+G2
m3 <- high_use ~ goout + traveltime + freetime + activities + studytime + paid + romantic+G1
m4 <- high_use ~ goout + traveltime + freetime + activities + studytime + paid + romantic
m5 <- high_use ~ goout + traveltime + freetime + activities + studytime + paid
m6 <- high_use ~ goout + traveltime + freetime + activities + studytime
m7 <- high_use ~ goout + traveltime + freetime + activities # The original model
m8 <- high_use ~ goout + traveltime + freetime
m9 <- high_use ~ goout + traveltime # The updated model used in assignments
m10 <- high_use ~ goout
for(i in 1:10) {
tmp <- paste0("m",i)
m <- glm(tmp, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
errors[i, "n_pred"] <- 11-i
errors[i, "training"] <- loss_func(class = alc$high_use, prob = alc$probability)
errors[i, "prediction"] <- cv$delta[1]
}
g <- errors %>% gather(key,error, training, prediction) %>%ggplot(aes(x=n_pred, y=error, colour=key))
g + geom_line() + scale_x_discrete(limits=c(1:10), name = "Number of predictors")
Both errors decrease slightly when number of predictors increases, but the trend is not linear or even stable. Also the absolute amount does not change that much only from 26% to 22% in training error between 2 (the model used in th exercises) and 6 predictors. If I check the summary of that six predictor model, only studytime has statistical significance in addition to the 2 predictor model. Of course, it would be possible to continue with this exploration and maybe even find smaller training and prediction error. Then we would be close to machine learning and that’s a topic for another time.
m <- glm(m5, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = m5, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6671 -0.7784 -0.5273 0.8741 2.4554
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1062 0.7143 -4.348 1.37e-05 ***
## goout 0.7234 0.1218 5.941 2.84e-09 ***
## traveltime 0.3829 0.1754 2.183 0.029070 *
## freetime 0.1199 0.1370 0.875 0.381614
## activitiesyes -0.2689 0.2513 -1.070 0.284700
## studytime -0.5986 0.1717 -3.487 0.000488 ***
## paidyes 0.4775 0.2549 1.873 0.061055 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 391.34 on 375 degrees of freedom
## AIC: 405.34
##
## Number of Fisher Scoring iterations: 4
In this chapter I look into linear discriminant analysis and k-means clustering using ready made data set from the MASS-library. please also see bonus, tho not ready yet.
Access the required libraries
library(MASS); library(tidyr); library(ggplot2); library(corrplot)
First I load the Boston data set and explore it with str and summary.
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The MASS package contains data sets to accompany book “Modern Applied Statistics with S” by W. N. Venables and B. D. Ripley with several distinct data from different sources. The Boston data set contains Housing values in suburbs of Boston. The set includes, for example, variables crim that is the per capita crime rate by town and age that is proportion of owner-occupied units built prior to 1940. For complete descriptions please see (Boston {MASS})[https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html].
Next, let’s plot distributions of the variables. For fun, let’s use color of the Faculty of Educational Sciences.
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill="#fcd116", color="#fcd116", alpha=0.9)
From the bar charts it is apparent that the data is not normally distributed and, in most cases, strongly skewed.
To asses different relationships of the variables let’s plot correlation matrix. I think lower triangular matrix looks better so let’s use type = "lower" instead upper as it was in the datacamp exercises.
cor_matrix <-cor (Boston)
cor_matrix <- cor_matrix %>% round(2)
corrplot(cor_matrix, method="circle", type="lower", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The per capita crime rate most strongly correlates positively with variables
rad and tax that are index of accessibility to radial highways and full-value property-tax rate per $10,000 respectively. These two variables are also strongly correlated together. I can not reason for the connection between these variables ans the crime rate.
For the following analyzes we need to standardize the data I use the scale function. The function subtracts column mean from each value and then divides it with the standard deviation: \(scaled(x) = \frac{x-\bar{x}}{\mu_x}\)
We can see this effect with the summary.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
The scaling does not change the distribution of the data but scales all variable values around zero (mean).
The scaled data set is a matrix instead of dataframe, so let’s change it back. Replotting distributions, we see that the actual distributions do not change, but the values are
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
Next, I create categorical variable of the crime rate by first dividing the data into four (quantile) bins and then cutting the data to those bins and giving each bin a label. Also, original crim variable is removed and the categorical variable added to the scaled data set.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Additionally, I divide the data set to train and test sets for future use. To select 80% of the data randomly as the train set, I use sample to randomly select n*0.8 indexes from the scaled data set. Then, I use those index to select the data. Using [-ind,] selects the rows not in [ind,]. Lastly, I save the correct crime classes form test set and remove them from that set.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
More about LDA in this StatQuest-video
For the linear discriminant analysis I use the categorical crime rate created above as the target variable and all other variables as predictors. Data for the fit is the train set previously cut from the boston_scaled dataframe. Next, I draw the LDA plot.
palette(c("#fcd116","#00a6ed","#f6511d", "#7fb800")) # Custom color palette
lda.fit <- lda(crime ~., data = train)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
Now, it is possible to predict crime rate categories using the LDA fit. Let’s check predictions from the test set agains the correct classes and cross tabulate the results.
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 9 1 0
## med_low 4 18 6 0
## med_high 0 6 17 2
## high 0 0 1 25
The predictions are quite good as most values are on the diagonal. Only med_low gets several false positives.
Before k-means clustering, I standardize the orginial Boston data set and calculate euclidean and manhattan distance.
Boston <- scale(Boston)
Boston <- as.data.frame(Boston)
dist_eu <- dist(Boston)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
dist_man <- dist(Boston, method = "manhattan")
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
The euclidean distances sre clearly shorter as, I think, they should be.
Next I do k-means clustering with three centers and draw correlation plots, with data point colors corresponding with clusters, of a few intresting variables.
km <-kmeans(Boston, centers = 3)
pairs(Boston[,c(1,9, 10, 12, 13, 14)], col = km$cluster)
To determine optimal number of clusters I calculate and visualize within cluster sum of squares (WCSS) as function of clusters. As kmeans assigns initial clusters randomly, I set constant seed used in random number generator.
set.seed(123) # the seed for rng
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss}) # these lines form datacamp exercise
qplot(x = 1:k_max, y = twcss, geom = 'line') + scale_x_continuous(breaks=c(1:10))
The twcss drops most drastically from one to two clusters meaning that two clusters is the optimal number of clusters.
km <-kmeans(Boston, centers = 2)
pairs(Boston[,c(1,9, 10, 12, 13, 14)], col = km$cluster)
Visually also two clusters looks better than three.
For this bonus assignment I use LDA fit for three clusters from k-means algorithm.
Boston <- dplyr::select(Boston, -chas) # need remove this variable as the lda.fit doesn't work with it when knitting
km <-kmeans(Boston, centers = 3)
lda.fit <- lda(km$cluster ~., data = Boston)
Visualization of clusters and arrows done with code from datacamp.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km$cluster)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes, )
lda.arrows(lda.fit, myscale = 3, color="black")
With three clusters it seem like variables age, zn, tax and nox are the most influential separators.
The code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
lda.fit <- lda(crime ~., data = train)
model_predictors <- dplyr::select(train, -crime)
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
In this chapter I look into principal component analysis (PCA) and multiple correspondence analysis (MCA) as dimension reduction techniques. PCA is used for continuous data and here I use the human data from the data wrangling exercise. MCA is done with categorical data from the FactoMineR library.
First access the libraries and load the human data. Let’s also define two custom colors for plots.
library(GGally); library(dplyr); library(corrplot); library(ggplot2); library(tidyr)
human <- read.csv("data/human2.csv", row.names = 1)
pca_colors <- c(adjustcolor("gray40", alpha.f = 0.5), "#fcd116") # transparent light gray and faculty yellow
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ edu2_ratio: num 1.007 0.997 0.983 0.989 0.969 ...
## $ lab_ratio : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Life_exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Year_edu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ MMR : int 4 6 6 5 6 7 9 28 11 8 ...
## $ ABR : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Rep : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## edu2_ratio lab_ratio Life_exp Year_edu
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI MMR ABR Rep
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
The data consists of eight continuous variables. As each observation i.e. row is named by a country the data does not include the country as categorical variable.
ggpairs(human)
Only the edu2_ratio and Year_edu are somewhat normally distributed and all other are clearly skewed. However, there are many significant correlation in the data. For example, edu2_ratio correlates with 5 out of 7 variables. Let’s draw correlation matrix to better visualize all correlations within this data.
cor(human) %>% corrplot(method="circle", type="lower", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Interestingly, lab_ratio and Rep are clearly much less strongly correlated that the other variables, but they do correlate with each other. The lab_ratio is ratio of females to males in the labour force and the Rep is percentage of female representative in parliament. Thus, I think, it is not surprising that these two correlate.
As the variables in the data are highly correlated, principal component analysis or PCA is applicable and can show which variables count most of the variance within the data.
Let’s begin with PCA on non-standardized human data and compute the variance percentages. Then repeat it with standardized data and plot both side-by-side with variance percentage on the labels.
par(mfrow = c(1,2))
pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = pca_colors, xlab = pc_lab[1], ylab = pc_lab[2])
human_std <- scale(human)
pca_human_std <- prcomp(human_std)
s <- summary(pca_human_std)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human_std, cex = c(0.8, 1), col = pca_colors, xlab = pc_lab[1], ylab = pc_lab[2])
The non-standardized data has one dimension that explains all of the variance. Consequently, this dimension is the GNI variable and this is because the absolute values of the variable are at least two orders f magnitude higher. Clearly, the data needs to be standardized before PCA can reveal any meaningful insights.
In the scaled PCA-biplot, the first principal component (PC1) explains 53.6% of the variance in the data. Let’s plot that but bit bigger, to make interpretation easier. Let’s also highlight Finland in the plot with small pink point.
biplot(pca_human_std, cex = c(0.8, 1), col = pca_colors, xlab = pc_lab[1], ylab = pc_lab[2])
x = pca_human_std$x["Finland",1]*2.5
y = pca_human_std$x["Finland",2]*4
points(x=x, y=y, pch = 20, col = "deeppink", cex = 0.8)
Again, the difference with variables
Rep and lab_ratio and rest of the variables is evident as those two variables mainly correlate with the second principal component (PC2). This component could describe, for example, female representation in society. As the first principal component could be some general index for standard of living as relates to life expectancy, income and education and on the other end maternal mortality and adolescent birth rate. The plot quite clearly groups countries by regions. For example, Finland is grouped with the other Nordic countries.
MCA is kinda like PCA, but for categorical data.
Access the FactoMineR library and the tea data set.
library(FactoMineR)
data(tea)
Select only the instructed columns and check summary and structure.
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
From the output it is clear that all of the six variables are categorical. Let’s draw bar plots to visualize the data.
p <- gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")
p <- p + geom_bar()
p + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Variable sugar is quite even, but all other variables have distribution with one popular category.
Next, produce the MCA model (without a graph) and print the summary of the model
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
The summary shows how much of variance the different dimensions explain. Let’s also visualize the MCA model.
plot(mca, invisible=c("ind"), habillage = "quali")
As the variance is similar between different dimensions, the MCA does not yield significant insight in to the data. Consequently, the dimensions of this plot are bit more challenging to interpret. Perhaps the horizontal dimension is how people drink tea that seems to be correspond with from where people by their tea. The vertical axis is more about how people drink tea.
This week we did not get exact instructions for the assignments as we have in the previous weeks as these are the final exercises and we should now know how to do the analyses. We are just told to do the analyses of chapter 8 of the MABS book using the RATS data and correspondingly chapter 9 analyses for the BPRS data effectively swapping the data sets used in the book examples. This results in some ambiguity about what to actually do as there are saome dissimilarities in the data sets. However, I try to follow the book examples and DataCamp exercises with respect to what I consider valid for the data.
Before the analysis let’s access the libraries.
library(dplyr); library(ggplot2); library(tidyr)
The data used for this part is the long form RATS data set. The data set consists of longitudinal data of weight of the rats over nine week period. The rats were dived into three groups based on their diet.
Let’s begin by reading in the long form data and re-factoring variables ID and Group. Let’s check the structure and also glimpse the data.
RATSL <- read.csv("data/RATSL.csv", )
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
Overall, the there are only 16 individual rats within those three groups. With such small number of subjects in each group, it would difficult to use any actual statistical test. However, we can use plots and summaries to describe possible patterns in the data.
Let’s first just plot each individual rat’s weight with respect to time in days.
ggplot(RATSL, aes(x = Time, y = Weight, group = ID, colour = Group)) +
geom_line() +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Weight (grams)") +
theme_classic() +
theme(legend.position = "top")
With each individual the plot might be somewhat unclear, so let’s also plot the mean weight of each group with error bars.
n <- RATSL$Time %>% unique() %>% length()
RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup() %>%
ggplot(aes(x = Time, y = mean, colour = Group)) +
geom_line(linetype = "dashed") +
geom_point(size=1) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, width=0.5)) +
theme(legend.position = c(0.8,0.8)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Mean weight (grams)") +
theme_classic() +
theme(legend.position = "top")
Overall, the mean weights are distinctly different for each group. Group one has clearly lowest weights and group three has the highest weights. Group one also has very small error bars as the weights are close together. Contrarily, group two has the largest error bars as it has most variation in the weihts between the individual rats. Actually, the weightiest rat is in groups two even tough groups three has the highest mean weights. We can also see this from the individual plots.
Even with the different starting weights of the groups, it is possible to see that, on average, each rat has gained weight during the observation period. Interpreting possible difference in weight gain in different groups and thus in diets is difficult with this plot. We can try to normalize the data set and plot the weights again.
Adding standardized weight stdweight to the data sat.
RATSL <- RATSL %>%
group_by(Group) %>%
mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
ungroup()
Plotting mean standardized weights.
RATSLS <- RATSL %>%
group_by(Group, Time) %>%
summarise( mean = mean(stdweight)) %>%
ungroup()
ggplot(RATSLS, aes(x = Time, y = mean, colour = Group)) +
geom_point(size=1) +
geom_line(linetype = "dashed") +
theme(legend.position = c(0.8,0.8)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Mean standardized weight") +
theme_classic() +
theme(legend.position = "top")
It would seem that on average groups one and three has had more weight gain than group two. Using the standardized weights is useful as is mitigates the effect of the starting weight of the rats.
After some basic plots the book has box plots and t-tests. However, box plots are not practical for the RATS data as there is so few individual rats in each group. In practice, there is really no distribution with only four rats. However, in table 8.2 of the book there are possible summary measures for growth data. Let’s choose regression coefficient and determine those for each group in the above plot.
for (group in c(1,2,3)){
print(paste("Fit for group:", group))
print(summary(RATSLS %>% subset(Group == group) %>% lm(mean ~ Time, data = .)))
}
## [1] "Fit for group: 1"
##
## Call:
## lm(formula = mean ~ Time, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.195155 -0.057337 -0.009761 0.031874 0.185846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.881389 0.069669 -12.65 4.90e-07 ***
## Time 0.026274 0.001797 14.62 1.41e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1159 on 9 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9551
## F-statistic: 213.9 on 1 and 9 DF, p-value: 1.406e-07
##
## [1] "Fit for group: 2"
##
## Call:
## lm(formula = mean ~ Time, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.112189 0.003262 0.010720 0.025446 0.067201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4955231 0.0341773 -14.50 1.52e-07 ***
## Time 0.0147717 0.0008814 16.76 4.29e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05686 on 9 degrees of freedom
## Multiple R-squared: 0.969, Adjusted R-squared: 0.9655
## F-statistic: 280.9 on 1 and 9 DF, p-value: 4.289e-08
##
## [1] "Fit for group: 3"
##
## Call:
## lm(formula = mean ~ Time, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37926 -0.05368 0.03869 0.07096 0.18075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.90338 0.09964 -9.066 8.04e-06 ***
## Time 0.02693 0.00257 10.480 2.42e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1658 on 9 degrees of freedom
## Multiple R-squared: 0.9243, Adjusted R-squared: 0.9159
## F-statistic: 109.8 on 1 and 9 DF, p-value: 2.418e-06
Let’s also plot those regression lines.
ggplot(RATSLS, aes(x = Time, y = mean, colour = Group)) +
geom_point(size=1) +
theme(legend.position = c(0.8,0.8)) +
scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10)) +
scale_y_continuous(name = "Mean standardized weight") +
theme_classic() +
theme(legend.position = "top") +
geom_smooth(method = "lm", se = FALSE)
Now, it is clear that group two differs from groups one and three that are quite similar with each other. Group two has lower correlation coefficient that groups one and three that have almost same coefficient. This can be seen clearly also form the plot.
At this stage, it would be possible to the conduct ANOVA (t-test is inapplicable as there are three groups). However, I do not see what additional information it would provide that would be meaningful for comparing these groups.
Data for the second part comes from clinical trial in which brief psychiatric rating scale (BPRS) was measured over eighth weeks for two groups with different treatments.
Read in the data and check structure and glimpse the data.
BPRSL <- read.csv("data/BPRSL.csv")
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, week0, wee…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Let’s start by drawing some plots.
# Draw the plot
ggplot(BPRSL, aes(x = week, y = bprs, colour = subject)) +
geom_line(alpha = 0.8) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme_minimal() +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
There are quite a lot of variation in the data, but at least in treatment one there seems to be trend of bprs declining over time.
As the repeated measures of same individual are not independent of each other simple linear regression models might not be suitable. This is where linear mixed effects models come in. These models assumes that individual’s patterns of responses depends on many random effects as unobserved variables. The idea is that the correlation within the repeated measurements is caused by those unobserved variables. The book introduces two linear mixed effects models called the random intercept model and random intercept and slope model. Let’s use and compare those models for the BPRSL data using the lme4 library.
library(lme4)
Let’s first create a random intercept model BPRSL_rim where the term (1 | subject) added to the model indicates that the subject is the random term. Next, we create the random intercept and slope model BPRSL_rism by adding also the week as random term. This allows individual differences in the slope in addition to just the intercept. To compare these to models we use ANOVA.
BPRSL_rim <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
BPRSL_rism <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
anova(BPRSL_rism, BPRSL_rim)
## Data: BPRSL
## Models:
## BPRSL_rim: bprs ~ week + treatment + (1 | subject)
## BPRSL_rism: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_rim 5 2748.7 2768.1 -1369.4 2738.7
## BPRSL_rism 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The low and significant \(\chi^2\) value means that the random intercept with slope model better fits the data.
For the final model we add possible interaction between variables week and treatment and compare this to the random intercept with slope model with ANOVA.
BPRSL_risim <- lmer(bprs ~ week + treatment + (week | subject) + week*treatment, data = BPRSL, REML = FALSE)
anova(BPRSL_risim, BPRSL_rism)
## Data: BPRSL
## Models:
## BPRSL_rism: bprs ~ week + treatment + (week | subject)
## BPRSL_risim: bprs ~ week + treatment + (week | subject) + week * treatment
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_rism 7 2745.4 2772.6 -1365.7 2731.4
## BPRSL_risim 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, as the \(\chi^2\) value is low and somewhat significant, the new model fits the data better. Let’s now use this model to create and add fitted values to BPRSL and plot side-by-side the observed and fitted values.
Fitted <- fitted(BPRSL_risim)
BPRSL<- BPRSL %>% mutate(Fitted)
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, colour = subject)) +
geom_line(alpha = 0.5) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme_minimal() +
theme(legend.position = "none") +
scale_y_continuous(name = "Observed bprs", limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
p2 <- ggplot(BPRSL, aes(x = week, y = Fitted, colour = subject)) +
geom_line(alpha = 0.5) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme_minimal() +
theme(legend.position = "none") +
scale_y_continuous(name = "Fitted bprs", limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
cowplot::plot_grid(p1, p2, align ="h")
Now, we have linear plots with the fitted data for each individual that takes into account the possible random effects in the data. The fitted plot are surprisingly similar even tough not identical and the plots for treatment two seems very different from each other. However, the there is no reason to doubt the results.
There might be some difference in the treatments, but it cannot be argued based on these plots. Let’s plot averages of the fitted values.
BPRSL %>%
group_by(treatment, week) %>%
summarise( mean = mean(Fitted)) %>%
ungroup() %>%
ggplot(aes(x = week, y = mean, colour = treatment)) +
geom_point(size=1) +
theme_classic() +
theme(legend.position = "top") +
scale_y_continuous(name = "Mean fitted bprs") +
geom_smooth(method = "lm", se = FALSE)
Now it would seem that, on average, treatment two could have been more efficient.
Endnote, as we did the fit for each individual, I’m rather confident that this averaging of the values is quite good at describing the overall temporal trend in the data. It would be possible continue with t-tests or ANOVAs with or without the week0 as baseline, but it was done in the datacamp exercises and not actually not asked. And as it is late, I do not continue with those even tough it would be quite interesting.